{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 拼写检查器原理 ###\n",
    "在所有正确的拼写词中, 我们想要找一个正确的词 c, 使得对于 w 的条件概率最大。求解：  \n",
    "P(c|w) ->  P(w|c) P(c) / P(w)   \n",
    "比如：appla是条件w，apple和apply是正确的词c，对于apple和apply来说P(w)都是一样的，所以我们在上式中忽略它, 写成:    \n",
    "P(w|c) P(c)\n",
    "* P(c), 文章中出现这个正确拼写的词 c 的概率, 也就是说, 在英语文章中, c 出现的概率有多大。  \n",
    "假设可以认为单词在文章中出现的概率越大，则正确拼写的概率就越大，可以用单词出现次数来代替这个量。好比说, 英语中出现 the 的概率  P('the') 就相对高, 而出现  P('zxzxzxzyy') 的概率接近0(假设后者也是一个词的话).\n",
    "* P(w|c), 在用户想键入 c 的情况下敲成 w 的概率。这个是代表用户会以多大的概率把 c 敲错成 w。  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 读取内容\n",
    "text = open('big.txt').read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 转小写，只保留a-z字符\n",
    "text = re.findall('[a-z]+', text.lower())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dic_words = {}\n",
    "for t in text:\n",
    "    dic_words[t] = dic_words.get(t,0) + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'submitted': 27,\n",
       " 'thiers': 13,\n",
       " 'communistic': 4,\n",
       " 'municipality': 4,\n",
       " 'annals': 5,\n",
       " 'realizes': 1,\n",
       " 'coachmen': 6,\n",
       " 'tantalizing': 1,\n",
       " 'all': 4144,\n",
       " 'twot': 1,\n",
       " 'charcot': 19,\n",
       " 'vicar': 2,\n",
       " 'unworthy': 13,\n",
       " 'begrudge': 1,\n",
       " 'predominates': 3,\n",
       " 'consumption': 4,\n",
       " 'inmate': 1,\n",
       " 'dolefully': 1,\n",
       " 'sportsman': 6,\n",
       " 'episcopalian': 2,\n",
       " 'assault': 13,\n",
       " 'befalls': 1,\n",
       " 'papillomas': 2,\n",
       " 'fashion': 49,\n",
       " 'trustfulness': 1,\n",
       " 'toils': 4,\n",
       " 'longtemps': 1,\n",
       " 'colostomy': 1,\n",
       " 'constructionists': 1,\n",
       " 'overstretching': 2,\n",
       " 'clumsy': 8,\n",
       " 'passport': 4,\n",
       " 'gauzy': 1,\n",
       " 'furtively': 2,\n",
       " 'verrons': 1,\n",
       " 'contused': 16,\n",
       " 'conducts': 1,\n",
       " 'viceroy': 1,\n",
       " 'avoiding': 22,\n",
       " 'locus': 1,\n",
       " 'dorogomilov': 9,\n",
       " 'nondescript': 1,\n",
       " 'shortened': 6,\n",
       " 'carefree': 2,\n",
       " 'caskets': 1,\n",
       " 'bronchitis': 1,\n",
       " 'calculated': 19,\n",
       " 'rate': 67,\n",
       " 'save': 110,\n",
       " 'smacking': 2,\n",
       " 'predispose': 2,\n",
       " 'frontiersman': 2,\n",
       " 'comment': 10,\n",
       " 'haemorrhages': 14,\n",
       " 'discover': 28,\n",
       " 'gazed': 94,\n",
       " 'administrative': 12,\n",
       " 'momburg': 1,\n",
       " 'trestle': 1,\n",
       " 'hereditary': 14,\n",
       " 'mintage': 1,\n",
       " 'tecumseh': 2,\n",
       " 'memorials': 3,\n",
       " 'desirability': 2,\n",
       " 'ruffian': 3,\n",
       " 'wastes': 2,\n",
       " 'nieces': 1,\n",
       " 'satyr': 1,\n",
       " 'apply': 43,\n",
       " 'gridneva': 1,\n",
       " 'liabilities': 1,\n",
       " 'ruza': 1,\n",
       " 'dilemma': 4,\n",
       " 'meshkov': 1,\n",
       " 'dmitrov': 1,\n",
       " 'rugayushka': 2,\n",
       " 'evaporating': 5,\n",
       " 'inferior': 18,\n",
       " 'adopting': 7,\n",
       " 'mass': 101,\n",
       " 'aqueous': 1,\n",
       " 'inert': 4,\n",
       " 'slits': 1,\n",
       " 'polonaise': 5,\n",
       " 'middle': 192,\n",
       " 'saute': 3,\n",
       " 'according': 164,\n",
       " 'ft': 2,\n",
       " 'gillies': 2,\n",
       " 'void': 21,\n",
       " 'sylvis': 1,\n",
       " 'obstructed': 10,\n",
       " 'spirituous': 1,\n",
       " 'awaiting': 53,\n",
       " 'splendidly': 11,\n",
       " 'hat': 105,\n",
       " 'foreclosing': 1,\n",
       " 'flirt': 4,\n",
       " 'untying': 1,\n",
       " 'vassal': 1,\n",
       " 'instant': 101,\n",
       " 'active': 96,\n",
       " 'mourn': 1,\n",
       " 'oliver': 2,\n",
       " 'rights': 168,\n",
       " 'intrigue': 11,\n",
       " 'preparing': 50,\n",
       " 'waitresses': 1,\n",
       " 'stupidest': 3,\n",
       " 'boyhood': 3,\n",
       " 'mortal': 14,\n",
       " 'insurance': 7,\n",
       " 'warsaw': 6,\n",
       " 'sheaves': 3,\n",
       " 'rejected': 35,\n",
       " 'journalist': 4,\n",
       " 'interference': 80,\n",
       " 'bark': 9,\n",
       " 'tranquil': 21,\n",
       " 'nonperformance': 1,\n",
       " 'tunes': 1,\n",
       " 'dolorosa': 3,\n",
       " 'shorten': 2,\n",
       " 'intercepted': 1,\n",
       " 'wetu': 1,\n",
       " 'buzzed': 1,\n",
       " 'confession': 14,\n",
       " 'charmed': 4,\n",
       " 'disk': 10,\n",
       " 'pall': 2,\n",
       " 'mycetoma': 7,\n",
       " 'disbelieve': 2,\n",
       " 'extirpated': 1,\n",
       " 'started': 96,\n",
       " 'alexis': 9,\n",
       " 'brushes': 1,\n",
       " 'cherishing': 2,\n",
       " 'april': 33,\n",
       " 'sanctioned': 4,\n",
       " 'maidens': 1,\n",
       " 'congress': 445,\n",
       " 'planted': 16,\n",
       " 'woolen': 15,\n",
       " 'de': 237,\n",
       " 'buoyed': 1,\n",
       " 'sensations': 9,\n",
       " 'harboring': 2,\n",
       " 'galoshes': 2,\n",
       " 'becomes': 245,\n",
       " 'kinds': 29,\n",
       " 'degenerates': 2,\n",
       " 'concerned': 80,\n",
       " 'tshausen': 1,\n",
       " 'drunkards': 1,\n",
       " 'volkmann': 5,\n",
       " 'provost': 1,\n",
       " 'tamed': 1,\n",
       " 'tilde': 3,\n",
       " 'consults': 1,\n",
       " 'trifles': 16,\n",
       " 'haemoptysis': 2,\n",
       " 'barbarity': 2,\n",
       " 'rod': 10,\n",
       " 'analogy': 5,\n",
       " 'thebes': 1,\n",
       " 'lived': 113,\n",
       " 'seminar': 1,\n",
       " 'martinet': 1,\n",
       " 'stimulated': 15,\n",
       " 'concrete': 4,\n",
       " 'cheyenne': 1,\n",
       " 'asks': 21,\n",
       " 'fixes': 1,\n",
       " 'fixing': 18,\n",
       " 'preopinant': 1,\n",
       " 'adornment': 1,\n",
       " 'panics': 3,\n",
       " 'rumpled': 1,\n",
       " 'rattle': 17,\n",
       " 'nonchalance': 1,\n",
       " 'listerian': 3,\n",
       " 'pashette': 1,\n",
       " 'sworn': 6,\n",
       " 'tribe': 4,\n",
       " 'captains': 12,\n",
       " 'contention': 6,\n",
       " 'unwontedly': 1,\n",
       " 'stars': 31,\n",
       " 'navigators': 1,\n",
       " 'province': 48,\n",
       " 'mutation': 1,\n",
       " 'loffler': 4,\n",
       " 'onset': 37,\n",
       " 'heralds': 1,\n",
       " 'dakotans': 1,\n",
       " 'certify': 2,\n",
       " 'los': 3,\n",
       " 'chanters': 2,\n",
       " 'relays': 4,\n",
       " 'ability': 14,\n",
       " 'ministry': 14,\n",
       " 'freed': 16,\n",
       " 'sounder': 2,\n",
       " 'gavril': 1,\n",
       " 'enriched': 3,\n",
       " 'victoria': 6,\n",
       " 'provided': 88,\n",
       " 'eyed': 20,\n",
       " 'chevalier': 4,\n",
       " 'tearworn': 1,\n",
       " 'azov': 1,\n",
       " 'bolshevists': 1,\n",
       " 'ruefully': 3,\n",
       " 'losing': 43,\n",
       " 'improving': 13,\n",
       " 'morton': 1,\n",
       " 'varix': 33,\n",
       " 'protects': 5,\n",
       " 'napkins': 1,\n",
       " 'complaining': 6,\n",
       " 'commune': 7,\n",
       " 'regiment': 230,\n",
       " 'propounding': 1,\n",
       " 'channing': 5,\n",
       " 'seizes': 4,\n",
       " 'smuggling': 3,\n",
       " 'separates': 15,\n",
       " 'woodburn': 1,\n",
       " 'powdery': 1,\n",
       " 'mirrors': 10,\n",
       " 'audibly': 3,\n",
       " 'endeavored': 2,\n",
       " 'moments': 56,\n",
       " 'profitability': 1,\n",
       " 'popped': 3,\n",
       " 'provinces': 42,\n",
       " 'strokes': 4,\n",
       " 'errand': 7,\n",
       " 'doings': 11,\n",
       " 'silhouette': 1,\n",
       " 'substitutions': 1,\n",
       " 'ramballe': 19,\n",
       " 'swabbed': 1,\n",
       " 'sorrows': 6,\n",
       " 'obliquity': 1,\n",
       " 'recruit': 5,\n",
       " 'disclaimer': 9,\n",
       " 'weaknesses': 6,\n",
       " 'ventured': 30,\n",
       " 'haemostatics': 3,\n",
       " 'aspen': 3,\n",
       " 'aesthetic': 2,\n",
       " 'measured': 25,\n",
       " 'father': 533,\n",
       " 'pleadingly': 1,\n",
       " 'rotary': 1,\n",
       " 'hustling': 1,\n",
       " 'plausible': 6,\n",
       " 'begged': 34,\n",
       " 'rabble': 7,\n",
       " 'roan': 2,\n",
       " 'computers': 7,\n",
       " 'vacancies': 13,\n",
       " 'speculator': 5,\n",
       " 'articular': 118,\n",
       " 'brunswick': 1,\n",
       " 'holding': 144,\n",
       " 'shyness': 5,\n",
       " 'proclaimed': 17,\n",
       " 'intellect': 18,\n",
       " 'communicable': 1,\n",
       " 'petty': 17,\n",
       " 'breaks': 17,\n",
       " 'fellow': 233,\n",
       " 'passes': 22,\n",
       " 'gut': 14,\n",
       " 'deduced': 12,\n",
       " 'tomfoolery': 1,\n",
       " 'garfield': 9,\n",
       " 'staples': 10,\n",
       " 'diverse': 18,\n",
       " 'devise': 9,\n",
       " 'english': 211,\n",
       " 'indefinable': 2,\n",
       " 'thrilling': 6,\n",
       " 'fox': 23,\n",
       " 'scullions': 1,\n",
       " 'indiscretion': 4,\n",
       " 'barbara': 5,\n",
       " 'chalice': 1,\n",
       " 'engrossing': 1,\n",
       " 'appositeness': 1,\n",
       " 'uselessness': 2,\n",
       " 'weighted': 2,\n",
       " 'keyhole': 2,\n",
       " 'grown': 100,\n",
       " 'softness': 7,\n",
       " 'intolerably': 2,\n",
       " 'nerves': 149,\n",
       " 'chord': 6,\n",
       " 'taverns': 4,\n",
       " 'whip': 40,\n",
       " 'beck': 3,\n",
       " 'believed': 89,\n",
       " 'kobelnitz': 3,\n",
       " 'tilt': 1,\n",
       " 'kensington': 1,\n",
       " 'gladdening': 2,\n",
       " 'janet': 1,\n",
       " 'cult': 1,\n",
       " 'policy': 116,\n",
       " 'echinoccus': 1,\n",
       " 'privat': 1,\n",
       " 'fix': 27,\n",
       " 'persia': 2,\n",
       " 'ducrey': 5,\n",
       " 'uncomplainingly': 1,\n",
       " 'vanquishing': 1,\n",
       " 'lasted': 31,\n",
       " 'comite': 1,\n",
       " 'napoleons': 4,\n",
       " 'foxy': 1,\n",
       " 'waiting': 183,\n",
       " 'descriptions': 8,\n",
       " 'bazdeev': 10,\n",
       " 'moravians': 2,\n",
       " 'adventurers': 6,\n",
       " 'advisable': 15,\n",
       " 'syllable': 7,\n",
       " 'rotten': 8,\n",
       " 'commotions': 1,\n",
       " 'accountable': 1,\n",
       " 'joan': 1,\n",
       " 'emaciation': 4,\n",
       " 'stake': 24,\n",
       " 'scabbards': 1,\n",
       " 'serajevo': 1,\n",
       " 'naphthol': 1,\n",
       " 'leisurely': 9,\n",
       " 'deputies': 2,\n",
       " 'labels': 3,\n",
       " 'unclothed': 1,\n",
       " 'tower': 9,\n",
       " 'pat': 3,\n",
       " 'hardened': 10,\n",
       " 'anton': 3,\n",
       " 'sadness': 16,\n",
       " 'quitted': 7,\n",
       " 'rafts': 1,\n",
       " 'alienated': 3,\n",
       " 'control': 122,\n",
       " 'rapt': 3,\n",
       " 'ascent': 3,\n",
       " 'perspiration': 13,\n",
       " 'petite': 1,\n",
       " 'alicia': 1,\n",
       " 'mansion': 5,\n",
       " 'slaveholding': 2,\n",
       " 'infernal': 2,\n",
       " 's': 5631,\n",
       " 'paddle': 3,\n",
       " 'tipsy': 13,\n",
       " 'soles': 4,\n",
       " 'wife': 367,\n",
       " 'appeared': 197,\n",
       " 'indianians': 1,\n",
       " 'ventral': 1,\n",
       " 'matins': 2,\n",
       " 'incised': 23,\n",
       " 'neuralgia': 19,\n",
       " 'infraction': 1,\n",
       " 'investigated': 3,\n",
       " 'continually': 91,\n",
       " 'fatty': 23,\n",
       " 'thermogene': 1,\n",
       " 'reverberating': 1,\n",
       " 'impunity': 5,\n",
       " 'pollard': 2,\n",
       " 'beggars': 3,\n",
       " 'unchanging': 7,\n",
       " 'opened': 217,\n",
       " 'reannexation': 2,\n",
       " 'rhythmical': 2,\n",
       " 'kolyazin': 3,\n",
       " 'reaper': 2,\n",
       " 'scrubbed': 1,\n",
       " 'undisturbed': 13,\n",
       " 'defiantly': 1,\n",
       " 'hyperplastic': 1,\n",
       " 'warring': 4,\n",
       " 'worrying': 8,\n",
       " 'adducing': 1,\n",
       " 'directors': 8,\n",
       " 'mounseer': 1,\n",
       " 'thatched': 2,\n",
       " 'collodion': 11,\n",
       " 'musical': 9,\n",
       " 'fibrositis': 16,\n",
       " 'imperturbably': 1,\n",
       " 'rushes': 5,\n",
       " 'parallels': 1,\n",
       " 'borrow': 6,\n",
       " 'heroes': 25,\n",
       " 'sandy': 3,\n",
       " 'thatch': 1,\n",
       " 'patroness': 2,\n",
       " 'alexey': 1,\n",
       " 'disillusion': 1,\n",
       " 'germans': 64,\n",
       " 'subsistence': 3,\n",
       " 'grenville': 10,\n",
       " 'standstill': 2,\n",
       " 'nephroma': 1,\n",
       " 'repair': 78,\n",
       " 'decks': 1,\n",
       " 'undetermined': 1,\n",
       " 'dig': 5,\n",
       " 'retaken': 2,\n",
       " 'mistakes': 15,\n",
       " 'doubted': 10,\n",
       " 'anxieties': 2,\n",
       " 'serpentine': 8,\n",
       " 'urge': 10,\n",
       " 'informing': 11,\n",
       " 'plainness': 7,\n",
       " 'costa': 1,\n",
       " 'belgians': 1,\n",
       " 'geneva': 5,\n",
       " 'flush': 12,\n",
       " 'mandibular': 5,\n",
       " 'tenders': 1,\n",
       " 'enfants': 2,\n",
       " 'swam': 2,\n",
       " 'unprecedented': 5,\n",
       " 'fundus': 1,\n",
       " 'wednesday': 13,\n",
       " 'remorselessly': 1,\n",
       " 'rotation': 7,\n",
       " 'hydatids': 7,\n",
       " 'conserve': 4,\n",
       " 'cheadle': 1,\n",
       " 'hooped': 1,\n",
       " 'kicks': 1,\n",
       " 'mather': 1,\n",
       " 'rupia': 4,\n",
       " 'addresses': 10,\n",
       " 'railway': 73,\n",
       " 'soar': 1,\n",
       " 'moan': 10,\n",
       " 'returning': 68,\n",
       " 'magnesium': 2,\n",
       " 'concluded': 57,\n",
       " 'unexposed': 1,\n",
       " 'controversies': 14,\n",
       " 'stein': 8,\n",
       " 'muskets': 22,\n",
       " 'dichotomy': 1,\n",
       " 'treasures': 2,\n",
       " 'peter': 52,\n",
       " 'gase': 1,\n",
       " 'constituted': 15,\n",
       " 'kissing': 38,\n",
       " 'shamefacedly': 1,\n",
       " 'mio': 2,\n",
       " 'frieze': 10,\n",
       " 'imprudently': 1,\n",
       " 'serried': 3,\n",
       " 'zoology': 3,\n",
       " 'sifted': 1,\n",
       " 'orthopaedic': 1,\n",
       " 'dances': 8,\n",
       " 'sailing': 13,\n",
       " 'podnovinsk': 1,\n",
       " 'relaxation': 5,\n",
       " 'conflagrations': 3,\n",
       " 'probably': 147,\n",
       " 'byways': 1,\n",
       " 'ugliness': 1,\n",
       " 'lest': 25,\n",
       " 'inhabitants': 80,\n",
       " 'unfulfilled': 1,\n",
       " 'but': 5653,\n",
       " 'gravid': 1,\n",
       " 'piquant': 1,\n",
       " 'tourist': 1,\n",
       " 'overseas': 6,\n",
       " 'sheet': 29,\n",
       " 'amants': 1,\n",
       " 'shcherbinin': 6,\n",
       " 'pearl': 4,\n",
       " 'unconsciousness': 3,\n",
       " 'suffices': 2,\n",
       " 'beauche': 1,\n",
       " 'stubby': 1,\n",
       " 'tomorrow': 112,\n",
       " 'meant': 113,\n",
       " 'portly': 6,\n",
       " 'rocky': 6,\n",
       " 'switch': 2,\n",
       " 'equally': 77,\n",
       " 'bien': 2,\n",
       " 'drastic': 21,\n",
       " 'offenser': 1,\n",
       " 'trophy': 4,\n",
       " 'memphis': 3,\n",
       " 'associate': 13,\n",
       " 'mb': 13,\n",
       " 'transact': 1,\n",
       " 'amos': 1,\n",
       " 'publish': 4,\n",
       " 'sots': 1,\n",
       " 'blood': 548,\n",
       " 'fraternal': 1,\n",
       " 'unvexed': 3,\n",
       " 'chessplayer': 1,\n",
       " 'introspect': 1,\n",
       " 'molibre': 1,\n",
       " 'decorous': 1,\n",
       " 'nominally': 6,\n",
       " 'pelisses': 1,\n",
       " 'nonconformist': 1,\n",
       " 'gabions': 1,\n",
       " 'genii': 1,\n",
       " 'console': 7,\n",
       " 'rectal': 7,\n",
       " 'dramatic': 10,\n",
       " 'ilarionovich': 6,\n",
       " 'glare': 5,\n",
       " 'about': 1497,\n",
       " 'divine': 34,\n",
       " 'swash': 1,\n",
       " 'assured': 41,\n",
       " 'exporting': 3,\n",
       " 'gentleness': 6,\n",
       " 'reinspected': 1,\n",
       " 'friction': 19,\n",
       " 'commented': 2,\n",
       " 'useless': 48,\n",
       " 'musician': 4,\n",
       " 'afterward': 19,\n",
       " 'lymphatic': 25,\n",
       " 'illogical': 4,\n",
       " 'ak': 3,\n",
       " 'practicing': 2,\n",
       " 'moschcowitz': 1,\n",
       " 'troughs': 1,\n",
       " 'longitudinally': 2,\n",
       " 'restfully': 1,\n",
       " 'periphery': 9,\n",
       " 'times': 236,\n",
       " 'adjustment': 8,\n",
       " 'validity': 4,\n",
       " 'justices': 2,\n",
       " 'completely': 95,\n",
       " 'dozen': 36,\n",
       " 'specialist': 8,\n",
       " 'curiosities': 1,\n",
       " 'portal': 2,\n",
       " 'olmutz': 22,\n",
       " 'permutations': 1,\n",
       " 'operate': 20,\n",
       " 'spectacled': 3,\n",
       " 'deferred': 3,\n",
       " 'cooperation': 21,\n",
       " 'equaled': 2,\n",
       " 'provide': 47,\n",
       " 'generalities': 1,\n",
       " 'store': 12,\n",
       " 'whiskey': 1,\n",
       " 'birthday': 6,\n",
       " 'humours': 1,\n",
       " 'shale': 1,\n",
       " 'supersensitiveness': 2,\n",
       " 'summarise': 2,\n",
       " 'feel': 161,\n",
       " 'influences': 23,\n",
       " 'alluringly': 1,\n",
       " 'colombia': 3,\n",
       " 'collar': 37,\n",
       " 'unpack': 1,\n",
       " 'grunting': 2,\n",
       " 'invention': 9,\n",
       " 'fatale': 1,\n",
       " 'domestics': 2,\n",
       " 'screw': 11,\n",
       " 'inattentive': 2,\n",
       " 'annually': 9,\n",
       " 'plaster': 19,\n",
       " 'emitted': 4,\n",
       " 'alternate': 8,\n",
       " 'gunshot': 6,\n",
       " 'usher': 1,\n",
       " 'gainer': 1,\n",
       " 'allowed': 86,\n",
       " 'advance': 95,\n",
       " 'thompson': 1,\n",
       " 'harmony': 15,\n",
       " 'venae': 1,\n",
       " 'nothingness': 2,\n",
       " 'contrat': 3,\n",
       " 'sarcomatous': 7,\n",
       " 'editions': 17,\n",
       " 'notion': 16,\n",
       " 'plucking': 2,\n",
       " 'honoured': 1,\n",
       " 'cooee': 7,\n",
       " 'averting': 3,\n",
       " 'voice': 462,\n",
       " 'shipwreck': 1,\n",
       " 'charges': 18,\n",
       " 'demerara': 1,\n",
       " 'deformities': 27,\n",
       " 'ossifying': 38,\n",
       " 'enthusiastic': 17,\n",
       " 'humorous': 2,\n",
       " 'forgive': 64,\n",
       " 'chuckling': 1,\n",
       " 'latterly': 9,\n",
       " 'struggling': 20,\n",
       " 'jerkin': 1,\n",
       " 'obsessed': 1,\n",
       " 'waits': 1,\n",
       " 'peal': 2,\n",
       " 'dangling': 4,\n",
       " 'princely': 4,\n",
       " 'decline': 26,\n",
       " 'broadsheets': 11,\n",
       " 'lispingly': 1,\n",
       " 'pounce': 1,\n",
       " 'preliminary': 17,\n",
       " 'brims': 1,\n",
       " 'perpendicular': 1,\n",
       " 'entrance': 59,\n",
       " 'starless': 1,\n",
       " 'belly': 11,\n",
       " 'considerate': 4,\n",
       " 'inhabited': 6,\n",
       " 'ask': 251,\n",
       " 'juniper': 3,\n",
       " 'amphitheater': 2,\n",
       " 'tribune': 3,\n",
       " 'innate': 4,\n",
       " 'adipose': 3,\n",
       " 'jaunty': 3,\n",
       " 'enticed': 1,\n",
       " 'welsh': 2,\n",
       " 'proceed': 18,\n",
       " 'limpet': 2,\n",
       " 'predominate': 10,\n",
       " 'scarlet': 22,\n",
       " 'overstepped': 1,\n",
       " 'parchment': 7,\n",
       " 'crinkled': 1,\n",
       " 'coquettish': 6,\n",
       " 'originates': 23,\n",
       " 'surrounded': 104,\n",
       " 'diagnosed': 16,\n",
       " 'allusion': 7,\n",
       " 'bite': 22,\n",
       " 'readmitted': 1,\n",
       " 'deceive': 13,\n",
       " 'calculations': 4,\n",
       " 'decency': 2,\n",
       " 'disgraces': 1,\n",
       " 'straightforward': 3,\n",
       " 'sophy': 1,\n",
       " 'rustan': 1,\n",
       " 'president': 371,\n",
       " 'sellers': 1,\n",
       " 'heals': 12,\n",
       " 'calcis': 1,\n",
       " 'babes': 1,\n",
       " 'stooped': 17,\n",
       " 'ultimate': 16,\n",
       " 'offences': 2,\n",
       " 'suites': 7,\n",
       " 'lays': 4,\n",
       " 'precision': 14,\n",
       " 'flannel': 3,\n",
       " 'implicate': 12,\n",
       " 'disturb': 15,\n",
       " 'est': 29,\n",
       " 'signer': 1,\n",
       " 'saphrophytes': 1,\n",
       " 'infected': 106,\n",
       " 'reduction': 21,\n",
       " 'chewed': 1,\n",
       " 'ame': 3,\n",
       " 'frees': 1,\n",
       " 'stir': 35,\n",
       " 'liable': 146,\n",
       " 'imprisoned': 12,\n",
       " 'brilliancy': 2,\n",
       " 'sleeper': 3,\n",
       " 'teamster': 2,\n",
       " 'plating': 1,\n",
       " 'fjords': 1,\n",
       " 'economies': 1,\n",
       " 'bustle': 12,\n",
       " 'everted': 9,\n",
       " 'school': 33,\n",
       " 'allegation': 1,\n",
       " 'supervised': 1,\n",
       " 'damning': 2,\n",
       " 'krasnaya': 2,\n",
       " 'affirming': 1,\n",
       " 'sending': 38,\n",
       " 'danger': 124,\n",
       " 'rehearsing': 1,\n",
       " 'literate': 1,\n",
       " 'bulbous': 6,\n",
       " 'balalayka': 6,\n",
       " 'osteosarcoma': 1,\n",
       " 'deficiencies': 1,\n",
       " 'impel': 1,\n",
       " 'horribly': 3,\n",
       " 'excitable': 2,\n",
       " 'kirghiz': 2,\n",
       " 'fort': 19,\n",
       " 'hydatid': 12,\n",
       " 'waved': 29,\n",
       " 'relaxed': 14,\n",
       " 'distracted': 11,\n",
       " 'snowed': 1,\n",
       " 'auctions': 1,\n",
       " 'hesitating': 16,\n",
       " 'mortgages': 7,\n",
       " 'seaport': 1,\n",
       " 'injecting': 5,\n",
       " 'diplo': 1,\n",
       " 'hard': 180,\n",
       " 'mighty': 30,\n",
       " 'granges': 2,\n",
       " 'integrity': 9,\n",
       " 'executed': 40,\n",
       " 'planked': 1,\n",
       " 'degree': 94,\n",
       " 'mexicans': 3,\n",
       " 'secrecy': 18,\n",
       " 'glassy': 2,\n",
       " 'inducing': 13,\n",
       " 'mazuwka': 1,\n",
       " 'stepanovich': 1,\n",
       " 'cured': 8,\n",
       " 'catiche': 10,\n",
       " 'concave': 3,\n",
       " 'hearers': 4,\n",
       " 'speciously': 1,\n",
       " 'robber': 5,\n",
       " 'thereof': 26,\n",
       " 'accredited': 1,\n",
       " 'delivery': 4,\n",
       " 'threshold': 19,\n",
       " 'beast': 26,\n",
       " 'mahan': 1,\n",
       " 'nobles': 9,\n",
       " 'obedience': 14,\n",
       " 'arithmetic': 2,\n",
       " 'andrews': 2,\n",
       " 'crudest': 2,\n",
       " 'squadrons': 5,\n",
       " 'spots': 12,\n",
       " 'stony': 6,\n",
       " 'intimation': 1,\n",
       " 'prospectors': 4,\n",
       " 'discharged': 18,\n",
       " 'checks': 6,\n",
       " 'rd': 10,\n",
       " 'tories': 17,\n",
       " 'bustled': 2,\n",
       " 'clatter': 12,\n",
       " 'smythe': 2,\n",
       " 'exalting': 1,\n",
       " 'vrazhek': 1,\n",
       " 'fused': 6,\n",
       " 'rows': 28,\n",
       " 'unpacked': 1,\n",
       " 'answering': 37,\n",
       " 'dwyer': 2,\n",
       " 'chestnut': 15,\n",
       " 'bidder': 1,\n",
       " 'phagocytosis': 6,\n",
       " 'coroner': 18,\n",
       " 'liar': 6,\n",
       " 'chilled': 2,\n",
       " 'loathe': 1,\n",
       " 'gloat': 2,\n",
       " 'squirrel': 4,\n",
       " 'aligning': 1,\n",
       " 'explosive': 1,\n",
       " 'hardness': 5,\n",
       " 'profiting': 2,\n",
       " 'rounds': 4,\n",
       " 'unsuitable': 8,\n",
       " 'desk': 11,\n",
       " 'responsibility': 41,\n",
       " 'fights': 2,\n",
       " 'burn': 40,\n",
       " 'poky': 1,\n",
       " 'laura': 3,\n",
       " 'thefts': 1,\n",
       " 'cher': 25,\n",
       " 'deacon': 10,\n",
       " 'conjuror': 1,\n",
       " 'scenery': 5,\n",
       " 'porter': 25,\n",
       " 'relieves': 2,\n",
       " 'delighted': 38,\n",
       " 'bold': 38,\n",
       " 'harmonize': 1,\n",
       " 'aix': 2,\n",
       " 'camped': 1,\n",
       " 'destructive': 28,\n",
       " 'testing': 6,\n",
       " 'surveying': 4,\n",
       " 'km': 1,\n",
       " 'rectus': 1,\n",
       " 'livery': 2,\n",
       " 'binder': 1,\n",
       " 'swollen': 58,\n",
       " 'reaffirmed': 1,\n",
       " 'effrontery': 1,\n",
       " 'armfeldt': 12,\n",
       " 'surgical': 66,\n",
       " 'bandy': 6,\n",
       " 'lighter': 9,\n",
       " 'corporal': 29,\n",
       " 'corded': 4,\n",
       " 'inflammatory': 48,\n",
       " 'believe': 183,\n",
       " 'sheriff': 1,\n",
       " 'irreconcilables': 1,\n",
       " 'short': 236,\n",
       " 'fissures': 5,\n",
       " 'repletion': 1,\n",
       " 'bald': 82,\n",
       " 'uncourteous': 1,\n",
       " 'peritoneum': 10,\n",
       " 'komoneno': 1,\n",
       " 'innervation': 4,\n",
       " 'galvanometer': 1,\n",
       " 'decade': 14,\n",
       " 'appreciative': 1,\n",
       " 'hothouse': 3,\n",
       " 'analyzed': 3,\n",
       " 'wet': 60,\n",
       " 'stromilova': 1,\n",
       " 'presnya': 1,\n",
       " 'honneur': 6,\n",
       " 'convulsive': 6,\n",
       " 'bwushed': 1,\n",
       " 'squatted': 5,\n",
       " 'frenzied': 1,\n",
       " 'impetus': 9,\n",
       " 'lech': 2,\n",
       " 'destinies': 1,\n",
       " 'convicted': 6,\n",
       " 'overcome': 32,\n",
       " 'sequestra': 8,\n",
       " 'dusk': 10,\n",
       " 'egotism': 8,\n",
       " 'surprising': 15,\n",
       " 'longer': 232,\n",
       " 'horrified': 16,\n",
       " 'looting': 11,\n",
       " 'curlpapers': 1,\n",
       " 'enamel': 3,\n",
       " 'sensational': 3,\n",
       " 'elbe': 1,\n",
       " 'fields': 77,\n",
       " 'lectures': 6,\n",
       " 'spence': 3,\n",
       " 'irresistible': 22,\n",
       " 'adhesion': 2,\n",
       " 'archduchy': 1,\n",
       " 'bug': 1,\n",
       " 'warranted': 3,\n",
       " 'planet': 3,\n",
       " 'mobs': 5,\n",
       " 'waxed': 1,\n",
       " 'jesters': 1,\n",
       " 'shameful': 10,\n",
       " 'curtseying': 1,\n",
       " 'thereon': 1,\n",
       " 'sodden': 12,\n",
       " 'knows': 102,\n",
       " 'narrators': 1,\n",
       " 'calms': 1,\n",
       " 'bureaucracy': 1,\n",
       " 'egyptians': 1,\n",
       " 'doorpost': 4,\n",
       " 'siam': 1,\n",
       " 'preceding': 18,\n",
       " 'oversea': 1,\n",
       " 'tangled': 8,\n",
       " 'torn': 60,\n",
       " 'nuclei': 4,\n",
       " 'shudders': 1,\n",
       " 'nearest': 50,\n",
       " 'progressivism': 1,\n",
       " 'disturbed': 36,\n",
       " 'womanish': 2,\n",
       " 'grasping': 14,\n",
       " 'he': 12401,\n",
       " 'celebrated': 23,\n",
       " 'perseveringly': 1,\n",
       " 'stone': 72,\n",
       " 'poklonny': 6,\n",
       " 'murky': 1,\n",
       " 'measles': 4,\n",
       " 'immersing': 1,\n",
       " 'dingley': 5,\n",
       " 'taurida': 4,\n",
       " 'agriculturists': 2,\n",
       " 'system': 241,\n",
       " 'absurdity': 6,\n",
       " 'plucked': 6,\n",
       " 'rudolf': 1,\n",
       " 'xxii': 10,\n",
       " 'wheeling': 4,\n",
       " 'clang': 5,\n",
       " 'ass': 2,\n",
       " 'chondromatosis': 2,\n",
       " 'agitations': 2,\n",
       " 'hysterical': 12,\n",
       " 'deranged': 2,\n",
       " 'lapse': 8,\n",
       " 'frola': 3,\n",
       " 'dedicated': 13,\n",
       " 'fusillade': 1,\n",
       " 'axis': 37,\n",
       " 'weady': 2,\n",
       " 'flung': 30,\n",
       " 'hydrophobia': 6,\n",
       " 'neapolitans': 1,\n",
       " 'tompkins': 1,\n",
       " 'antics': 1,\n",
       " 'mikolka': 1,\n",
       " 'darwin': 1,\n",
       " 'horner': 13,\n",
       " 'provisions': 52,\n",
       " 'dentition': 2,\n",
       " 'sofa': 84,\n",
       " 'recalls': 2,\n",
       " 'binding': 18,\n",
       " 'meshchanski': 1,\n",
       " 'aristocracy': 16,\n",
       " 'deathbeds': 1,\n",
       " 'redstone': 1,\n",
       " 'plotted': 2,\n",
       " 'bevelled': 1,\n",
       " 'spirilla': 4,\n",
       " 'speed': 31,\n",
       " 'elated': 4,\n",
       " 'inflamed': 53,\n",
       " 'maneuvered': 2,\n",
       " 'number': 301,\n",
       " 'politely': 12,\n",
       " 'violet': 8,\n",
       " 'pointed': 90,\n",
       " 'presentable': 1,\n",
       " 'frenchman': 102,\n",
       " 'schneider': 2,\n",
       " 'deputation': 9,\n",
       " 'deficient': 6,\n",
       " 'riverbank': 1,\n",
       " 'boycott': 1,\n",
       " 'tax': 84,\n",
       " 'severity': 44,\n",
       " 'geometry': 6,\n",
       " 'andwew': 1,\n",
       " 'immobile': 1,\n",
       " 'manless': 1,\n",
       " 'contagiousness': 6,\n",
       " 'sail': 8,\n",
       " 'wrath': 17,\n",
       " 'endeavors': 3,\n",
       " 'zigzag': 2,\n",
       " 'matches': 7,\n",
       " 'raleigh': 2,\n",
       " 'availing': 8,\n",
       " 'steaminess': 1,\n",
       " 'sleighs': 11,\n",
       " 'eclipsed': 2,\n",
       " 'caustique': 2,\n",
       " 'shatters': 1,\n",
       " 'mozhaysk': 25,\n",
       " 'sensation': 73,\n",
       " 'clothing': 26,\n",
       " 'prolonged': 61,\n",
       " 'straighten': 6,\n",
       " 'soldierly': 2,\n",
       " 'chatting': 2,\n",
       " 'charging': 4,\n",
       " 'neskuchny': 1,\n",
       " 'personified': 1,\n",
       " 'oz': 1,\n",
       " 'pikestaff': 1,\n",
       " 'welt': 2,\n",
       " 'indexes': 3,\n",
       " ...}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dic_words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  编辑距离: ###\n",
    "两个词之间的编辑距离定义为使用了几次插入(在词中插入一个单字母), 删除(删除一个单字母), 交换(交换相邻两个字母), 替换(把一个字母换成另一个)的操作从一个词变到另一个词."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 字母表\n",
    "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n",
    "\n",
    "#返回所有与单词 word 编辑距离为 1 的集合\n",
    "def edits1(word):\n",
    "    n = len(word)\n",
    "    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion\n",
    "               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition\n",
    "               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration\n",
    "               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'pple'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "apple = 'apple'\n",
    "apple[0:0] + apple[1:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'aple'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "apple = 'apple'\n",
    "apple[0:1] + apple[2:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'aple'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "apple = 'apple'\n",
    "apple[0:2] + apple[3:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'appe'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "apple = 'apple'\n",
    "apple[0:3] + apple[4:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'appl'"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "apple = 'apple'\n",
    "apple[0:4] + apple[5:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#返回所有与单词 word 编辑距离为 2 的集合\n",
    "#在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词\n",
    "def edits2(word):\n",
    "    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "114818"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "e1 = edits1('something')\n",
    "e2 = edits2('something')\n",
    "len(e1) + len(e2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与 something 编辑距离为1或者2的单词居然达到了 114,818 个  \n",
    "优化:只把那些正确的词作为候选词,优化之后edits2只能返回 3 个单词: ‘smoothing’, ‘something’ 和 ‘soothing’"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "P(w|c)求解：正常来说把一个元音拼成另一个的概率要大于辅音 (因为人常常把 hello 打成 hallo 这样); 把单词的第一个字母拼错的概率会相对小, 等等。但是为了简单起见, 选择了一个简单的方法: 编辑距离为1的正确单词比编辑距离为2的优先级高, 而编辑距离为0的正确单词优先级比编辑距离为1的高.一般把hello打成hallo的可能性比把hello打成halo的可能性大。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def known(words):\n",
    "    w = set()\n",
    "    for word in words:\n",
    "        if word in dic_words:\n",
    "            w.add(word)\n",
    "    return w\n",
    "\n",
    "# 先计算编辑距离，再根据编辑距离找到最匹配的单词\n",
    "def correct(word):\n",
    "    # 获取候选单词\n",
    "    #如果known(set)非空, candidates 就会选取这个集合, 而不继续计算后面的\n",
    "    candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or word\n",
    "    # 字典中不存在相近的词\n",
    "    if word == candidates:\n",
    "        return word\n",
    "    # 返回频率最高的词\n",
    "    max_num = 0\n",
    "    for c in candcidates:\n",
    "        if dic_words[c] >= max_num:\n",
    "            max_num = dic_words[c]\n",
    "            candidate = c\n",
    "    return candidate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'smoothing'"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#appl #appla #learw #tess #morw\n",
    "correct('smoothig')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'battle'"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "correct('battl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'learn'"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "correct('learww')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'dagsgasdfeg'"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "correct('dagsgasdfeg')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
